CC-News-En: A Large English News Corpus

Joel Mackenzie; Rodger Benham; Matthias Petri; Johanne R Trippas; J Shane Culpepper; Alistair Moffat

Conference Proceedings

CC-News-En: A Large English News Corpus

Joel Mackenzie, Rodger Benham, Matthias Petri, Johanne R Trippas, J Shane Culpepper, Alistair Moffat

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management | ACM | Published : 2020

DOI: 10.1145/3340531.3412762

Abstract

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, form..

View full abstract

University of Melbourne Researchers

Alistair Moffat Author

Related Projects (1)

NEW APPROACHES TO INTERACTIVE SESSIONAL SEARCH FOR COMPLEX TASKS

Grants

Awarded by Australian Research Council

Funding Acknowledgements

We thank Sebastian Nagel (Common Crawl) for providing useful information about the news crawl, and the Common Crawl organization for their commitment to providing open data. This work was partially supported by Australian Research Council Grants DP170102231, DP190101113, and DP200103136. The second author was supported by an RMIT VCPS.

Citation metrics

66Scopus

30Web of Science

54Dimensions

Keywords

4605 Data Management and Data Science

Technology

4609 Information Systems

Science & Technology

Computer Science

Computer Science, Information Systems

46 Information and Computing Sciences

Computer Science, Theory & Methods